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Abstr act . 

The  problem  of  clustering  individuals  is  considered  within  the 
context  of  a  mixture  of  distributions.  A  modification  of  the  usual 
approach  to  population  mixtures  is  employed.  As  usual;  a  parametric 
family  of  distributions  is  considered;  a  set  of  parameter  values  being 
associated  with  each  population.  In  addition;  with  each  observation  is 
associated  an  identification  parameter,  indicating  from  which  population 
the  observation  arose.  The  resulting  likelihood  function  is  interpreted 
in  terms  of  the  conditional  probability  density  of  a  sample  from  a  mixture 
of  populations,  given  the  identification  parameter  of  each  observation. 
Clustering  algorithms  are  obtained  by  applying  a  method  of  iterated 
maximum  likelihood  to  this  likelihood  function. 

AMS  1970  subject  classification.  62H3O;  Secondary  62E10. 

Key  words  and  phrases.  Mixture  of  distributions,  cluster  analysis,  isodata 
procedure,  k-means  procedure,  Mahalanobis  distance. 
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Summary . 

The  problem  of  clustering  individuals  is  considered  within  the  context 
of  a  mixture  of  distributions.  A  modification  of  the  usual  approach  to 
population  mixtures  is  employed.  As  usual,  a  parametric  family  of  distri¬ 
butions  is  considered,  a  set  of  parameter  values  being  associated  with  each 
population.  In  addition,  with  each  observation  is  associated  a  parameter 
indicating  from  which  population  the  observation  arose.  The  resulting 
likelihood  function  is  interpreted  as  the  conditional  probability  density 
of  a  sample  from  the  mixture  of  populations,  given  the  population  identi¬ 
fications  of  each  observation. 

The  relation  of  this  conditional  mixture  model  to  the  standard  mixture 
model  is  discussed}  it  is  shown  how  the  concept  of  the  conditional  mixture 
model  provides  a  probability  model  for  cluster  analysis,  and  it  is  shown 
how  to  use  the  model  to  provide  a  plausible  general  method  for  clustering. 

Given  a  parametric  family  of  distributions,  an  appropriate  clustering 
algorithm  is  obtained  by  applying  a  method  of  iterated  maximum  likelihood 
to  the  resulting  likelihood  function.  The  algorithms  resulting  by  application 
of  this  general  method  are,  then,  interpretable  as  schemes  for  estimating  the 
parameters  of  probability  models . 
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Special  attention  is  given  to  the  case  of  multivariate  normal  popu¬ 
lations  with  common  covariance  matrix.  This  case  is  of  special  interest 
because  application  of  the  general  method  produces  Mahalanobis -distance 
versions  of  two  well-known  clustering  algorithms ,  isodata  and  k -means  , 
thereby  relating  these  algorithms  to  a  probability  model  for  the  clustering 
problem.  Other  models  given  special  attention  are  the  multivariate  normal 
distribution  with  different  covariance  matrices,  and  multinomial  models, 
especially  the  model  based  on  an  assumption  of  local  independence  as  used 
in  latent  structure  analysis. 

1.  Introduction. 

The  problem  of  11  clustering1'  to  be  considered  here  is  as  follows: 
given  a  sample  of  p-vectors  x^,x^, « . . ,x^,  that  is ,  a  sample  of  p 
observations  on  each  of  n  individuals,  put  the  individuals  into  groups. 

Of  course  the  problem  needs  more  formalization  if  we  are  to  be  able  to 
do  anything  meaningful  with  it. 

We  begin  by  defining  a  clustering  as  a  partition  of  the  set  of 

observations,  that  is,  a  collection.  {C  ,C  }  of  disjoint  sets 

1  d  K 

such  that  each  observation  belongs  to  one  and  only  one  set  C  .  Each  set 

§ 

C  (g=l,...,k)  is  a  cluster . 

g 

In  this  paper  we  shall  assume  that  the  integer  k  is  specified  in 
advance.  (A  modification  of  the  >  algorithm  -to  be:  presented  allOwd'Wdme  ofvwV 
the  clusters  to  join  or  split ;  thereby  permitting  fewer  or  more  than  k 
clusters  to  be  formed.  See  Section  6.2.) 
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As  has  been  suggested  before  [see.,  e.g.,  Fleiss  and  Zubin  (1969 )  ] , 

it  seems  reasonable  togconsider  a  population  mixture  model  for  clustering 

problems .  With  the  g-th  population  is  associated  the  probability  density 

function  h  (x) ,  g=l,...,k.  When  we  are  working  with  some  parametric 
*** 

family ,  say  indexed  by  a  parameter  p,  h  takes  the  form  h  (x)  =  h(x;  p  ). 

~  g  g  ~  ~  ~g 

The  densities  (or  parameters)  are  unknown,  this  being  the  distinction 
between  the  present  formulation  of  the  clustering  problem  and  the  classical 
classification  problem,  sometimes\termed''identification" ,  "discrimination’1, 
or  "allocation".  In  the  classical  problem,  the  densities  or  parameters  are 
known,  or  else  a  training  set  of  data  is  available,  from  which  the  densities 
or  parameters  can  be  estimated. 

Now  with  Individual  i  (i=l,...,n)  associate  the  group  identification 
parameter  7^  which  is  equal  to  g  if  and  only  if  Individual  i.  belongs 
to  group  g(g=l,2, . . . ,k) .  Each  individual  gives  rise  to  a  pair  (X,y).  X 
is  observable;  7  is  not.  It  will  thus  be  seen  that  this  problem  fits 
into  the  framework  of  an  empirical  Bayes  problem  [see,  e.g.,  Robbins  (1964)], 
but  in  the  present  paper  this  aspect  will  not  be  studied  explicitly. 

In  the  terminology  used  by  Neyman  and  Scott  (1948)  in  a  study  of  con¬ 
sistent  estimation,  the  parameters  7^  are  "incidental"  parameters  because 
each  of  them  refers  to  a  finite  number  of  observations  (one  in  the  present 

case),  while  the  parameters  p  are  "structural"  parameters  because,  if 

~S 

we  allow  n  to  tend  to  infinity,  each  of  them  is  associated  with  an  infinite 
number  of  observations. 

In  the  context  of  this  model,  to  "cluster"  is  merely  to  estimate 
the  7^'s,  i=l,..,,n  individuals.  ,  ;  .  ’  -  ul  •'  ,1  ,  t-  ■  ,  •. ..." 
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!■  ••  u  ;cv. 

It  is  convenient  to  reparametrize  somewhat.  Replace  7^  by  the 
k-vector  8^  which  consists  of  k-1  zeros  and  a  single  1,  the  position 
of  the  1  indicating  which  group  Individual  i  belongs  to;  that  is, 

8^  has  a  1  as  its  7^-th  element  and  0's  elsewhere.  The  density  of 
X^,  given  6  ,  is 


(1.1) 


f  (x.  |e.  )  =  £  8.h(x.)  , 

~1  ~1  g=1  gl  g  ~1 


where  0  .  is  the  g-th  component  of  8 . . 

gi  0  ~i 


2.  The  probability  model. 

This  model  should  be  compared  and  contrasted  with  the  usual  population 

mixture  model,  in  which  any  observation  x^  is  chosen  from  Population  g 

with  probability  jt  ,  so  that  the  density  of  X.  is 

g  ^1 


(2.1) 


k 

L 

g=l 


jt  h  (x. ) 
g  g  ~i 


The  probability  model  that  will  be  used  here  for  the  clustering  problem 
is  as  follows.  It  is  assumed  that  pairs  (X  ,0^) ,  i=l,...,n,  have  been 

sampled  randomly,  in  the  sense  that  their  joint  density  is 


n 

II 

i=l 


fX.  ,0.  ^i'~i^ 

~i  ~i 
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(2.2) 


(The  notation  which  will  be  used  here  is  the  '.standard  notation  in 
which  f  and  F  are  generic  symbols  for  probability  density  functions 
and  cumulative  distribution  functions;  respectively;  f^  denotes  the 
probability  density  function  of  the  random  variable  X;  f^  y  denotes 
the  joint  density  of  X  and  Y,  fyjx  denotes  the  conditional  probability 
density  function  of  Y,  given  X;  etc.  For  the  moment  we  suppress  the 
subscript  i„) 

The  conditional  density  of  X  given  ©  is 


k 


^  8ghe(J) 


The  marginal  density  of  0  is  taken  to  be  the  point  multinomial; 


81 


=  nl  *2  nk  ’ 


V°  or  1,  £^-1,  *g>°,  V.  ■ 

is  the  probability  that  a  randomly  selected  individual  comes  from 
Population  g. 

First  it  will  be  shown  that  the  standard  mixture  density  is  indeed 

the  marginal  density  for  X  resulting  from  this  model.  Somewhat  more 

generally;  let  Z  =  (Z„ ;Z„ , . . „ ,Z,  )  be  a  random  vector.  If  the  conditional 
~  J_  d.  K 

density  of  X  given  Z  is_ 


k 

f  |  (x]z)  =  2  z  b  (x)  * 

VI  ~  ~  g=l  S  S  ~ 
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then  the  marginal  density  of  X  is 


f  (x)  =  X  E[Z  ]h  Jx) 
t  ~  g=l  S  g  ~ 


To  see  this,  note  that  we  have 


fx(x)  =  /  fx  Z(x,z)  dz 


=  /  fx|z(x|z)  fz(z)  dz 


=  /  t  z  h  (x )  f  (z)  dz 
g=l  g  g  ~  ~  ~ 

k 

-  L  [fz h„(x) 
g=l  S  S 


=  X  E[Z  ]  h  (x)  - 

g=l  g  g  ~ 


From  this  it  follows  that  if  ©  =  ...,©)  has  the  point 


multinomial  density 


f  (F))  _  81  92  9k 

f©(?}  ~  *1  *2  ' * '  ' 


k  1c 

rt  >  0,  £  „jt  =1,0  =0  or  1,  X  „0  =1,  then  the  marginal  density 

g  g=lg  g  g=l  g  - 21 - a 


of  X  is  the  standard  mixture  density 


f  (x)  =  Z  it  h  (x) 

5  -  8=i  s  s  ' 
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For ,  under  this  model  the  random  variable  Z  =  0  is  Bernoulli  with 

g  g 

parameter  n  ;  hence  E[0  ]  =  it  . 

g  g  g 

Now  suppose  that  pairs  (X.,0.),  i=l,...,n,  are  sampled  randomly,, 

in  the  sense  that  their  joint  density  is  (2,2).  Of  course,  then  the 
X's  are  independent ,  and  the  ©'s  are  independent.  Then  the  conditional 

rv  <-v 

density  of  X  ,X  , , . . ,X  >  given  ©  .0 , . . ,  ,0  is 
•vi  ~cL  ~n  r^c~  ~n 


( P  \  if*  I  (x_  9  o  •  •  )  X  {  JJ  ♦  •  *  0  ) 

'  *  J  *  X-  y  o  • « ©_  .?  a » j  ~n  ^'1  ~n 

-n  -1 


-n 


fv  n  Y  n  A  > 

X  _j©  ^ •  a  .?©  *^1  **1  ~n  ~n 

~1  ~1  ~n  -vn  _ _ 

f’  ~  ^  {6-,.'.,,e~) 

0  } . . » ,0  <vl  ~n 
~1  ~n 


n 


.n  fX.;0.  ^~i;£i^ 

1=1  ~i  ~i 


n 


n  fn  (e.) 

.  ,  ©.  ~i 
i=i  ~i 


n 


TT  fy  |  (X.  [S.jf  (©  „  ) 

A  X.  ®.  <~i  ~x  0.  '  x< 

i=l  ~i  ~i 


~i 


n 


f,  (0h 

1=1  Sx  -1 


n 


.n  fX. |e. ^-i  ~i^  * 

1=1  ~1  0,1 
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It  is  (2.3)  which  is  the  "likelihood11  in  the  conditional  population 
mixture  model.  In  the  context  of  this  models  then,  to  "cluster"  is  to 

estimate  the  0.’s,  the  values  of  the  identification  parameters. 

~  1 

Versions  of  this  model  have  been  used  recently  by  Scott  and. Symons 
(l97l)  and  S.  John  (1970),  but  the  model  dates  back  at  least  to  Gibson 
(1959);  where  it  is  called  the  latent  profile  model.  This  model  has 
been  discussed  by  Anderson  (1959)° 

The  likelihood  approach  to  clustering  is  illuminating  in  that  it 
sometimes  shows  how  ax[  hoc  optimality  criteria  (objective  functions) 
which  have  been  proposed  for  the  clutering  problem  relate  to  particular 
probability  models.  For  example,  Scott  and  Symons  (l97l)  show  how  various 
optimality  criteria  relate  to  maximum  likelihood  clustering  in  multi¬ 
variate  normal  populations . 

Note  that  we  can  equivalently  write  (l.l)  as  a  product: 

(2.4)  f(x^ 


l?i) 


k 

II 

g=5 


e 


[h  (x  )  j 
g  -1 


gl 


The  form  (2.4)  is  often  more  convenient,  and  we  shall  use  it  in  what 
follows . 

It  is  easy  to  allow  for  the  presence  of  a  "training  set"  of  data  -- 
a  prior  set  of  observations  for  each  of  which  we  know  the  group  identi¬ 
fication.  Letting  m  be  the  number  of  prior  observations  in  the  g-th 

g 

group  and  denoting  the  prior  observations  from  the  g-th  group  by  w  , , 
-0=1,..., m  ,  we  can  write  the  likelihood  as 


.  m  . 

kg  n  k  9  . 

n  n  h  (w  „)  it  n  [h  (x.)] 


g=l  •*=! 


“g-0‘ 


i=l  g=l 
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and 


if  we  treat  all  the  observations  W  g=l,..«,k,  t=l,...,m  , 

g 

X  ,  is=l,...,n  as  statistically  independent,  ¥e  do  not  explicitly 

treat  the  case  of  prior  observations  any  further  here. 


5,  The  clustering  algorithm. 

Using  the  form  (2,4),  one  sees  that  under  the  random  sampling 

mechanism  mentioned  above  the  joint  probability  density  function  of 

X  ,X  , , 0 , ,X  ,  given  0  ,0  , . . . ,0  is 
~n  ~x  ~£-  ~n 


n  k  9  . 

n  ir  [h_(x.)3  81  , 
i=i  g^i  8 


or,  in  parametric  form. 


n  k  9  . 

n  n  [h(x,;p  )]  gl 
i=l  g=l  ~L  -g 


The  likelihood  is  to  be  maximized  over  all  assignments  of  individuals 
to  groups  and  over  all  permissible  parameter  values.  Many  ad  hoc  schemes 
can  be  applied  to  this  maximization  problem.  For  example,  one  way  to 
maximize  is  to  start  with  a  given  clustering  CL^, « ...  ,.C  ,  take  each 
observation  successively  and  shift  it  to  the  first  cluster  for  which  a 
shift  results  in  an  increase  in  likelihood,  „ar.d  loop  through  the  data  i 
until  no  individual  changes  clusters , 

The  algorithm  to  be  described  here  is  an  iterated,  that  is,  a  back-and- 
forth  procedure  of  maximizing  this  likelihood  function,  in  that  we  first 
maximize  with  respect  to  the  0's  (holding  the  p's  fixed  at  initial 
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values);  then  we  maximize  with  respect  to  the  p's  (holding  the  0‘s 
fixed  at  the  values  obtained  in  the  previous  stage) ,  then  we  again 
maximize  with  respect  to  the  0!s  (holding  the  p's  fixed  at  the  values 
obtained  in  the  previous  stage);  etc.  We  stop  when  no  9  changes;  i,e.; 
when  no  individual  changes  clusters  --  or  when  we  have  used  a  pre¬ 
specified  amount  of  computer  time. 

An  alternative  for  starting  the  procedure  is  to  start  with  an' initial 
clustering  rather  than  with  initial  guesses  of  the'.’ p’s . 

Al 

A 

It  is  clear  that;  for  fixed  values  of  the  p's;  say  p's;  the 
likelihood  is  maximized;  for  each  i;  by  taking 


(3-1) 


A 


if  h(x  ;p  ) 

~1  ~g 


max  [h(x  ;p  )) 
l<I<k 


=  0  otherwise. 


(in  case  of  ties  an  arbitrary  choice  is  made.)  In  other  words;  clustering 
proceeds  by  allocating  Individual  i  to  that  group  for  which  the  esti¬ 
mated  probability  density  of  the  observation  x^  is  largest. 

Note  that;  having  tentatively  estimated  the  /~s  (or;  equivalently; 
the  0's)  at  any  stage;  that  is }  having  tentatively  clustered  the 
individuals;  estimation  of  the  p's  is  reduced  simply  to  ordinary 
maximum  likelihood  estimation  in  the  particular  parametric  family  at 
hand. 

Let  T  denote  the  set  of  Q  's  and  B  the  set  of  B  's.  Write 

~  i  ~g 

L(B;T)  to  denote  the  likelihood.  Let  B^  denote  the  value  of  B 


-10- 


which  maximizes  L  at  the  s-th  stage  of  the  .iteration,  and  similarly 
(s) 

let  T  '  denote  the  value  of  I  which  maximizes  L  at  the  s-th  stage 

(s)  (s) 

of  the  iteration.  Then  '  maximizes  L(BV  ,T )  with  respect  to  T, 
(s)  (s  “1 ) 

and  ET  maximizes  L(B,TV  with  respect  to  S.  Ag  a  function  of 

B,  h(B,T('S’1))  is  the  section  of  L(B,T)  at  T=T^S_1^  and  L(B^,T) 

(s) 

as  a  function  of  T  is  the  section  of  L(B,T)  at  B=B^  „  We  may  refer 
to  this  back-and-forth  maximization  as  section-wise  maximization.  It  is 
an  example  of  the -relaxation  method  (or  "Southwell’s  method");  see  Ortega 
and  Rheinboldt  (197O,  -pp.  2l4ff . )'  and  Southwell  l(l9:40  and  1946) .  . 

It  is  true  that 

L(B(s+1)^T(s))  >  l(b^s^T^S^)  , 

and 

l(B(s),T(s+l))  >L(B(S);T(S)) 


that  is,  at  no  stage  of  the  procedure  can  the  value  of  the  likelihood 
be  decreased;  however,  there  is  no  guarantee  of  convergence  to  the 
global  maximum  (neither  do  alternative  clustering  algorithms  guarantee 
convergence  to  the  global  maximum  of  their  objective  functions). 

To  see  how  the  procedure  can  fail  to  converge  to  a  global  maximum, 
suppose  it  happens  that  L(B^S^,T^S^)  >L(B,T^),  for  all  B,  or 
L(B^'S),T<'S  ^ )  >  L(B^S^,T),  for  all  T.  Then  the  procedure  will  terminate 
at  the  s-th  stage,  without  having  necessarily  reached  a  global  maximum. 
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That  is,  if,  having  maximized  with  respect  to  one  of  the  variables  B 
and  T,  we  happen  to  find  ourselves  at  a  (relative)  maximum  with  respect 
to  the  other,  we  may  not  reach  a  global  maximum. 

3ack-and-forth  iterative  methods  such  as  the  one  developed  here 
are  familiar  in  other  estimation  problems ,  notably  in  weighted  least 
squares  estimation,  where  we  iterate  between  estimating  the  weights  and 
the  regression  coefficients,  and  in  factor  analysis,  where  we  iterate 
between  estimating  the  communalities  and  the  factor  loadings . 

4.  Application  to  particular  distributions. 

Now  we  consider  application  of  this  general  clustering  method  to 
particular  families  of  distributions.  First  we  consider  normal  distri¬ 
butions  with  common  covariance  matrix,  for  it  is  in  this  case  that  it 
becomes  clear  how  the  model  establishes  a  link  with  some  existing  cluster¬ 
ing  procedures . 


4.1.  Multivariate  normal  populations  with  common  covariance  matrix. 

In  the  case  of  p-variate  normal  populations  with  means  p  , 

~g 

g=l,...,k,  and  common  covariance  matrix  1,  the  likelihood  takes  this 
form: 


(2m) 


-np/2 


Z\~n/2  exp  [4  I  Z  0  (x  -p  )’  Z_1  (x  “P  )  ] 
•  _  -i  gi  ~i  ~g  -l  ~g 


i=l  g=l 


Here  (3»l)  is  equivalent  to 
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0  .  =  1  if 
gx 

(4.l) 


A  ^—1  A  f  ^ 

(x.-|x  )!  E  (x  -p  )  •=  min  ( (x . -p.)'  Z  (x  -p.)} 
~i  <~g  ~1  ~S  l<£<k  ~1  ~ ^  ~1 


-  0  otherwise. 


That  is.  Individual  i  is  assigned  to  that  group  to  whose  tentatively- 

estimated  centroid  he  is  closest ,  where  the  distance  is  in  the  metric 

of  the  tentatively  estimated  covariance  matrix.  Having  estaimt-ed  the 

0's,  we  have  multivariate  normal  observations  arranged  into  groups; 

maximization  with  respect  to  the  p’s  and  £  is  accomplished  by  taking 

the  sample  mean  vectors  as  estimates  for  the  p's,  and  the  within-groups 

sum-of-products  matrix  gives  the  estimate  of  Z.  The  procedure  is 

iterated:  using  new  estimates  p  ,  g=l, . . . ,k, ■ and  t,  (4.l)  is  applied 

again.  Then  new  p’s  and  a  new  E  are  calculated,  etc.  The  matrix 

E  can  be  updated  efficiently.  Also,  the  Mahalanobis  distances  in  (4.l) 

can  be  efficiently  computed  as  follows.  These  distances  are  of  the  form 

v’M  v,  where  v  =  (x.^-  p)  and  K  ~  Z'l  To  evaluate  a  quadratic  form 
.  -l 

v  M  v,  given  M  and  v,  one  notes  that,  algebraically,  the  solution 
x  of  the  system  Mx  =  v  is  x  =  M  1y..  Numerically,  this  solution  x 

can  be  obtained  efficiently,  without  doing  all  the  arithmetic  operations 

-X  -X 

required  to  obtain  M  ,  One  then  computes  the  value  of  v’M  v  simply 

as  v’x  .  [See  Anderson  (1958),  p.  107.] 


Relationship  with  the  '’'isodata1'  prooedure.  This  scheme  is  a 
Mahalanobis -distance  version  of  Ball  and  Hall's  (1967)  isodata  clustering 
procedure.-  (Earlier  documentation  of  isodata  by  Ball  and  Hall  exists, 
but  the  1967  reference  is  perhaps  the  most  accessible .)  The  isodata  scheme 
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proceeds  as  follows.  One  starts  with  tentative  estimates  of  cluster 
means  and  assigns  each  individual  to  the  mean  to  which  he  is  closest. 
(The  isodata  scheme  uses  Euclidean  distance,  or  modified  Euclidean 
distance  in  which  different  weights  are  assigned  to  the  p  dimensions.) 
The  cluster  means  are  then  re-estimated,  and  one  loops  through  the  data 
again,  reassigning  the  individuals,  etc.  Note  the  similarity  to  our 
scheme.  We  start  with  tentative  estimates  of  the  p’s  and  £  (it 
seems  a  good  idea  to  take  the  initial  estimates  of  the  p’s  to  he 
outside  the  convex  hull  of  the  data,  and  it  is  easy  to  take  the  initial 
estimate  of  £  to  he  the  identity  matrix), and  assign  each  individual  to 
the  mean  to  which  he  is  closest,  using  Mahalanobis -distance  in  the 
metric  of  the  tentatively  estimated  covariance  matrix.  The  p's  and 
£  are  then  re-estimated,  the  individuals  are  re-allocated  to  clusters, 
etc. 

An  important  difference  is  that  our  scheme  employs  Mahalanobis - 
distance  rather  than  Euclidean  or  weight ed-Euclidean  distance.  And  it 
is  worth  emphasizing  that  it  is  the  Mahalanobis  distance  based  on  the 
within-groups  sum-of-produets  matrix  that  arises  here;  some  data 
analysts  use  the  total  sum-of-products  matrix,  which,  as  Chernoff  (197O) 
for  example,  has  argued ,Jis  not  appropriate.  I  have  done  data  analyses 
using  both  the  total  and  the  within-groups  sum-of-products  matrices, 
and  the  total  sum-of-products  matrix  gave  poor  results,  while  the  within 
groups  sum-of-products  matrix  gave  good  results . 


For  example,  consider  the  Fisher  iris  data  [Fisher  (1936)],  con¬ 
sisting  of  p=4  measurements  on  each  of  50  irises  in  each  of  k=3 
species.  If  the  sample  centroids  of  the  three  species  are  computed  from  the 
group-identified  data  and  the  150  flowers  are  then  assigned  to  that  centroid  to  which 
they  are  "closest",  then  only  three  mis classifications  are  made  when 
the  distance  is  in  the  metric  of  the  within-groups  covariance  matrix,  11 
misclassifications  are  made  if  Euclidean  distance  is  used,  and  20  mis- 
classifications  are  made  when  the  distance  is  in  the  metric  of  the  total 
covariance  matrixl 

One  further  point  along  these  lines:  Mahaianobis- distance  is  the 
same  as  Euclidean  distance  in  terms  of  principal  axes.  Hence  some  data 
analysts  transform  the  raw  data  into  scores  on  principal  components,  so 
that  they  can  simply  use  Euclidean  distance.  Their  mistake  is  that  they 
use  the  principal  components  of  the  total  sum-of -products  matrix.  The 
Euclidean  distance  they  calculate  is  then  the  same  as  Mahaianobis -distance 
in  the  metric  of  the  total  sum-of -products  matrix,  which  is  not  appropriate. 

I  have  programmed  three  algorithms  in  APL  [I.B.M.  (1969);  Iverson 
(1962)]  --  the  algorithm  developed  here,  in  which  at  any  stage  the  dis¬ 
tance  is  in  the  metric  of  the  tentatively-estimated  covariance  matrix, 
an  algorithm  in  which  Euclidean  distance  is  used  at  each  stage,  and  an 
algorithm  in  which  at  each  stage  the  distance  was  in  the  metric  of  the 
total  covariance  matrix.  Results  of  two  rums  of  each  of  the  three 
algorithms  on  the  Fisher  iris  data  will  be  given  here.  In  one  run,  the 
initial  centroids  (initial  estimates  of  the  three  mean  vectors)  were 
flowers  in  the  same  species  ("difficult  initial  centroids").  In  another 
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run,  the  initial  centroids  were  three  flowers  from  the  three  different 
species . 


RESULTS  OF  TWO  RUNS  OF  EACH  OF  THREE  ALGORITHMS 


Difficult  Easy- 

Metric.  initial  centroids  initial  centroids 

Number  of  mis-  Number  of  Number  of  mis-  Number  of 

classifications  iterations  classifications  iterations 

before  con-  before 

vergence  convergence 


The  adaptive  . 
metric  of  the 
algorithm, 
starting  with 

Z=I  (Euclidean 
distance) 

6 

14 

3 

5 

Euclidean  distance 

16 

11 

16 

3 

Distance  in  the 

40 

10 

29 

6 

metric  of  the 
total  sum-of- 
products  matrix 


Relationship  with  the  "k-means"  procedure.  Arranging  the  computation 
a  little  differently,  updating  the  estimates  of  the  p's  and  L  after 
each  individual  is  assigned  rather  than  waiting  until  all  individuals 
have  been  assigned,  produces  a  Mahalanobis-distance  version  of  MacQueen's 
(1966)  k-means  procedure. 

Thus,  a  link  has  been  established  between  some  of  the  better  known 
ad  hoc  clustering  procedures  and  a  probability  model  for  the  clustering 
problem. 
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4,2.  Multivariate  normal  populations  with  different  covariance  matrices. 


The  algorithm  generated  for  this  case  turns  out  not  to  be  simply 
to  use  a  different  Mahalanobis  distance  for  each  cluster.  The  complica¬ 
tion  which  occurs  is  analogous  to  that  in  ''classical5'  classification 
(discriminant  analysis ),  where  one  is  led  to  quadratic  discriminant 
functions  if  the  covariance  matrices  differ. 

The  details  are  as  follows.  The  likelihood  in  this  case  is 

/_  n  k  -9  ,/2  n  k 

(2 *>'np/  n  n  is  l  gl  exp[-iE  E  e  (x  -p  )'£'  (x  -p  )] 

i=l  g=l  8  i=l  g=l  8  8  8  18 

In  this  case  (j.l)  becomes 

(4.2)  9  .  -  1  if  setting  i  -  g  maximizes 

i^T2  exp[4(x.-^)'  ^(x.-^)] 

=  0  otherwise. 

Maximizing  the  expression  in  (4.2)  is  equivalent  to  minimizing 

^n|£^|  +  £g  " 

It  has  been  noted  [see,  e.g..  Day  (l 969)]  that  in  the  standard 
mixture  model  for  this  case  the  supremum  of  the  likelihood  is  infinity.  This 
is  reflected  in  the  fact  that  in  our  algorithm  it  would  be  possible  that 
at  some  stage  one  of  the  clusters  would  consist  of  a  single  individual, 
so  that  the  tentative  estimate  of  the  mean  of  that  cluster  would  be  the 
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vector  of  observations  for  that  individual,  and  the  tentative  estimate  of 

the  covariance  matrix  of  that  cluster  would  be  undefined.  It  is  also 

possible  for  the  observations  in  a  given  cluster  to  be  very  close  to 

lying  on  a  lower-dimensional  subspace,  so  that  the  tentative  estimate  of 

the  covariance  matrix  could  have  an  arbitrarily  small  determinant,  and 

the  maximized  likelihood  could  be  arbitrarily  large,  for  the  contribution 

of  Cluster  g  to  the  maximized  likelihood  is  |s  |  ng/2  exp(-pn  /2),  where 

g  g 

n  is  the  number  of  individuals  assigned  to  that  cluster, 
g 

4.3.  Multinomial  models . 

Multinomial  models  are  of  special  interest  because  they  relate  to 
the  analysis  of  questionnaires  and  of  patterns  of  medical  symptoms. 

Suppose  each  variable  X^.,  v=l,...,p,  is  a  dichotomous  variable  (indi¬ 
cating  a  Yes  or  No  answer ,  or  presence  or  absence  of  a  symptom).  It  is 
not  reasonable  to  assume  the  X's  independent  in  the  whole  (mixture) 
population.  It  is  sometimes  assumed,  however,  that  within  subpopulations , 
they  are  independent.  This  model  of  local  independence  is  employed  in 
latent  structure  analys is  [Lazarsfeld  and  Henry  (1968)].  If  we  let 


=  Pr(X  .=1)  =1  -  PrfX  .=0). 
1  vi  ^  vi  1 


for  the  g-th  subpopulation,  then  under  the  assumption  of  local  independence 
the  density  in  the  g-th  subpopulation  is 


h 

g 


ki'gg) 


x 


VI 


p 

=  H  P 
v=l  V§ 


(1-0  ■  ) 
vg 


1-xlr 

vi 
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The  ’'clusters"  are  the  subpopulations. 


5.  Comparison  with  the  method  based  on  the  standard  mixture  model, 
Wolfe  (197O)  has  considered  clustering  based  on  the  standard 
mixture  model.  Under  that  models  the  posterior  probability  that 
Individual  i  belongs  to  Group  g  is 


(5-1) 


x  h  (x.  ) 

— g_g_jx3wxS.  . 


If  we  can  obtain  estimates  for  6  ,  it  ,  g=l,...,k,  they  can  be  substi- 

~g  g 

tuted  to  provide  an  estimate  of  (5.1), 


(5-2) 


£  h(x. j  p  ) 

_ g  ~g 

k  ^ 


Individual  i  is  assigned  to  that  Group  g  for  which  the  estimated 

posterior  probability  of  group  membership  (5.2)  is  largest.  (Recall 

that,  with  the  conditional  mixture  model.  Individual  i  is  assigned  to 

that  Group  g  for  which  the  estimated  density  h(x.;f3  )  is  largest.) 

~i  ~g 

Wolfe  has  provided  computer  programs  for  the  case  of  noimal  dis¬ 
tributions.  As  is  well  known,  the  maximum  likelihood  equations  for 
mixture  problems  are  messy.  Wolfe  solves  them  by  a  multivariate  Newton- 
Raphson  method.  This  involves  the  assignment  of  arbitrary  initial  values 
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to  the  parameters,  to  start  the  iterative  solution,  as  does  the  general 
method  described  here. 

Perhaps  a  word  may  be  said  by  way  of  further  comparison  of  the 
standard  and  conditional  mixture  models.  The  likelihood  in  the  standard 
model  is 

n 

n  fY  (x. )  , 

i=i  lx 

or 


n 

n 

i=l 


I  © . )  dF  (s.) 

~i  0.  ~i 

~i 


} 


whereas  the  likelihood  in  the  conditional  model  is 


n 

n 

i=l 


(x,  |s, ) 


} 


so  that  in  using  the  conditional  model  we  are  using  the  factors 
f  I  (x. I©.)  rather  than  a  smoothed  version  of  them,  namely 

a.  .  (y ,  •'vi  'vi 
*vi  'v i 


/ 


fY  .  (x. Is.)  dF  (S  )  =  E 

x.  0.  ~i  ~i  0. 

~i  ~i 


X.  0 


.  *  fx.(5i 

1  J  ~1 


J 
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Note  that 


max 

?1 )  '  °  ‘  ?  t 


n 


-n 


1=1  ~1  ~1 


n  k 

max  n  n  [h 


-n 


i=l  g=l 


=  II  max  n  [  h  (x  )  ] 


6  . 

gi 


1=1  0 .  g=l 

~i  ° 


g  ~i 


n 


=  n  max  f  |  (x  |0. ) 

i=l  0 .  -  I’  ~  ~1 


-i  ~i 


-  n  fx  (x± ) 
1=1  ~i 


n 


n  d  (x . ;  it, ) . . . , n  )  , 

1=1  x  k 


no  matter  what  the  values  of  jt  , .  * .  }n  .  Thus 

1  k 


n 

max  n  f  |  (x  |0  ) 

3  n  -r  -i  -A.  .  th  ~  1  r~l 

V’"^  1=1  ~i  ~i 

-I  ~n 


n 


max  n  j  (x . ; 
UL 


max 


n 

where  L  ’  (jr  >  xn  , . .  .  ,x  )  =  II  j  (x.  ;  nr.  , .  .  .  ,jt.  )  denotes  the 

1  k  ~1  ~n  .  1  ~i  1  k 

i=l 

likelihood  corresponding  to  the  standard  model.  If  it  is  legitimate  to 
compare  likelihoods  under  the  two  different  models,  this  shows  how 
"overfit"  occurs  when  we  use  conditional  models;  the  same  concepts  apply 
to  the  "shrinkage  problem"  in  regression  analysis  when  we  predict  using 
an  estimated  regression  function. 

Note  that,  under  the  assumption  or  random  sampling  from  the  k 
populations,  the  rtg's  standard  model  can  be  estimated  after 

clustering  based  on  the  conditional  model;  we  can  take  as  the  estimate 
the  proportion  of  individuals  assigned  to  Population  g: 


✓s 


i  l 

n  i=l 


n 

c  -£ 
n 


) 


,  „n  * 

where  n  =  £ .  .  0  . 

g  i=l  gi 


is  simply  the  number  of  individuals  assigned  to 


Population  g.  That  is,  under  an  assumption  of  random  sampling,  we  can 


use  results  obtained  from  working  with  the  conditional  distribution  of 
X  to  estimate  parameters  of  the  marginal  distribution  of  X  . 


These  two  types  of  models,  conditional  and  unconditional,  arise  in 


other  statistical  contexts  as  well,  notably  analysis  of  variance  [Eisen- 
hart's  (1947 )  classification  of  effects  as  "Model  i"  or  "Model  II"  is 
now  standard]  and  factor  analysis  [see  Anderson  and  Rubin  (1956)]. 
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6. 


Some  remarks  on  statistical  inference. 


Again  let  L(B,T)  denote  the  likelihood  as  a  function  of  the 
structural  parameters  B  and  the  incidental  parameters  T,  given  the 
data.  The  maximum  likelihood  estimate  of  (B,T)  is  the  value  (B,T) 
for  which  L  is  largest.  The  quantity  L(B,T)  is  the  corresponding 

A  A 

maximum  value  of  the  likelihood.  To  approximate  (B,T),  one  uses  the 
algorithm  Let 


\(B,T)  =  L(B,T]/L(B,T)  . 

Let  F  denote  the  asymptotic  -(as;  '.n  tends  to  infinity)  cumulative 

distribution  function  of  —2  In  \(B,T):  lim  Pr(-2  In  k(B,T)  <  x]  =  F(x). 

n-*00 

Suppose  that  F  is  independent  of  (B,T).  For  example,  it  may  be  the 
cumulative  distribution  function  of  a  chi-square  distribution  with  an 
appropriate  number  of  degrees  of  freedom;  it  is  necessary  to  investigate 
the  extent  to  which  the  large  sample  theory  of  the  generalized  likelihood 
ratio  applies  when  there  are  incidental  parameters  » 

6.1.  Confidence  sets. 

Let  x^  denote  the  upper -a  percentage  point  of  F.  Then 
1  -  a  =  F(x )  =  Pr{-2  In  \(B,T)  <  x  }  =  Prf-2  In  L(B,T)  <  x  +  2  In  L(B, 

OC  OC  OC 
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H3> 


so  that 


{(B,T):  -2  In  L(B,T)  <  xa  +  2  ln(B,T)} 

is  an  approximate  100(l-ai)$  confidence  set  for  (B,T). 

Denote  by  (B,T)  the  estimates  produced  by  the  algorithmo  Then 
L(B,T)  <  L(B,T).  Thus  a  conservative  confidence  set  one  that  contains 
more  values  of  (B,T)  than  the  true  confidence  set  and  has  confidence 
coefficient  at  least  1-0!  --  is 

{(B,T):  -2  ln(B,T)  <  xq  +  2  ln(B,T)}  . 


6.2.  Some  remarks  on  choice  of  k.  ■  -l  ;  'X'.'.t...--  \ '  ’;vi  k  Tu  i  am 

.•The  algorithm  can  be  run  with  different  choices  of  k  and  the 
results  can  be  compared.  Note  that  the  likelihood  function  is  a  different 
function  for  different  values  of  k«  Denote  this  dependence  upon  k  by 

A  A 


denoting  the  likelihood  by  L  (B  ,T  ).  Let  B  ,  T  denote  the  maximum 

K.  K,  A  A 

likelihood  estimates.  Following  Wolfe’s  approach  for  the  standard  mixture 
model,  one  might  make  a  sequence  of  hypothesis  tests  to  decide  on  k. 


/\  A 


first  comparing  ^(B^T^)  with  L^(B^,T^),  then  if  necessary  comparing 
Lj(B^,T^)  with  L^(B^,T^),  etc.  Wolfe  uses  the  asymptotic  chi-square 
distribution  of  the  generalized  likelihood  ratio  here;  even  in  the  context 
of  the  standard  mixture  model  this  may  not  be  the  asymptotic  distribution. 

An  alternative  approach  to  the  choice  of  k;,  is  to  follow  a  suggestion 
of  MacQueen  and  introduce  refinement  and  coarsening  parameters  R  and  C 
such  that  two  clusters  coalesce  when  their  centroids  are  less  than  R  units 
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apart  and  a  cluster  splits  when  its  diameter  (maximum  distance  between 
any  two  of  its  members)  exceeds  C. 

7 .  Conclus ions . 

A  modification  of  the  usual  mixture  model  has  been  employed  to 
provide  a  probability  framework  for  clustering  problems,  A  general 
method  of  producing  clustering  algorithms  which  correspond  to  a  method 
of  iterated  maximum  likelihood  has  been  given.  The  general  method  given  here 
is  a  plausible  method  for  clustering  which  is  linked  to  a  probability  model 
and  which  is  comparatively  easy  to  program.  In  the  case  of  multivariate 
normal'.i'distributlons  with  common  covariance  matrix  the  general  method 
produces  clustering  schemes  which  can  be  viewed  as  improved  versions  of 
some  existing  schemes. 

The  focus  here  has  been  on  the  parametric  case?  but  the  methods  dis¬ 
cussed  might  be  applied 'to  the  nonparametric  case  by  estimating  the  densities 

h  (x)  as  the  clustering  proceeds?  using  standard  methods  of  density  esti~ 
g  ~ 

mat ion. 

Clustering  algorithms  based  on  a  likelihood  function  are  based  on 
the  raw  data  matrix?  in  contradistinction  to  many  clustering  procedures 
which  are  based  on  a  matrix  of  pairwise  similarities  or  distances.  The- latter 
procedures  have  the  advantage  of  applicability  to  problems  where  a  raw 
data  matrix  is  not  available.  When  the  raw  data  are  available?  such 
algorithms  have  the  disadvantage  of  not  extracting  all  the  information 
from  the  observations  and  the  computational  disadvantage  of  preliminary 
computation  of  all  the  pairwise  distances  (or  similarities). 


-25- 


Acknowledgements .  This  research  has  been  supported  by  National  Science 
Foundation  Grants  GP-22595  Carnegie -Mellon  University  and  GP 
at  Stanford  University  and  Office  of  Naval  Research  Contract  $N000lA- 
67-A-OII2-OO3O  (NR-0i+2-0/34)  at  Stanford  University. 


-26- 


REFERENCES 


[1]  Anderson,  T.  W.  (1958)0  An  Introduction  to  Multivariate  Statistical 

Analysis .  John  Wiley  and  Sons,  Inc.,  New  York. 

[2]  Anderson,  T,  W.  (1959).  Some -sealing  models  and  estimation  pro¬ 

cedures  in  the  latent  class  model.  Probability  and  Statistics: 
The  Harald  Cramer  Volume?  U.  Grenander,  ed.,  9~58.  Almayist 
and  Wiksell,  Uppsala. 


[5]  Anderson,  T.  W.,  and  Rubin,  Herman.  (1956).  Statistical  inference 
in  factor  analysis.  Proc .  Third  Berkeley  Symposium  Math. 
Statist;  and  Prob . ,  J.  Neyman,  ed.,  5*  111-1 50*  University 
of  California  Press,  Berkeley  and  Los  Angeles. 


[1+]  Ball,  G.  H.yand  Hall,  David  J.  (1967).  A  clustering  technique  for 
summarizing  multivariate  data.  Behavioral  Sciences  12,  153“ 

155. 


[5]  Chernoff,  Herman.  (l970)«  Metric  considerations  in  cluster  analysis. 

Proc.  Sixth  Berkeley  Symposium  Math.  Statist,  and  Prob;  1, 
621-629.  ~  ~ . 


[6]  Day,  N.  E.  (1969).  Estimating  the  components  of  a  mixture  of 
normal  distributions.  Biometrika  56,  463-475° 


[7]  Eisenhart,  C.  (194-7).  The  assumptions  underlying  the  A.O.V.  Biometrics 

3,  1-21. 


[8]  Fisher,  R.  A.  (1936).  The  use  of  multiple  measurements  in  taxonomic 
problems.  Ann.  Eugen .  7>  179-188. 


[9]  Fleiss,  J.  L. ,  and  Zubin,  J.  (1969).  On  the  methods  and  theory  of 
clustering.  Multivariate  Behavioral  Research  4,  235-25O. 


[10]  Gibson,  W»  A.  (1959).  Three  multivariate  models:  factor  analysis, 
latent  structure  analysis,  and  latent  profile  analysis. 
Psychometrika  24,  229-252. 


-27- 


[11]  International  Business  Machines  Corp.  (l 969) «  APL-56O  Primer >  2nd 

edM  (IBM  Publication  GH20-0689“l)  *  IBM  Corporation.,  Technical 
Publications  Dept . ,  White  Plains,  Hew  York, 


[12]  Iverson,  Kenneth  E.  (1962),  A  Programming  Language.  John  Wiley  and 
Sons,  Inc.,  New  York. 


[13]  John,  S.  (1970).  On  identifying  the  population  of  origin  of  each 

observation  in  a  mixture  of  observations  from  two  normal 
populations.  Technometrics  12,  553~563» 

[14]  Lazarsfeld,  Paul  F.,  and  Henry,  Neil  W.  (1968).  Latent  Structure 

Analys is .  Houghton  Mifflin  Co.,  Boston, 

[15]  MacQueen,  J.  (l 966).  Some  methods  for  classification  and  analysis 

of  multivariate  observations.  Proe .  Fifth  Berkeley  Symposium 
Math.  Statist,  and  Prob,  1,  281-297” 

[16]  Neyman,  J.,  and  Scott,  E.  L»  (1948).  Consistent  estimates  based 

on  partially  consistent  observations,  with  particular 
reference  to  structural  relations.  Econometrica  16,  1-32. 


[17]  Ortega,  James,  and  Rheinboldt,  Werner,  (197°)  ■>  Iterative  Solution 
of  Nonlinear  Equations  in  Several  Variables.  Academic  Press, 
New  York, 


[18]  Robbins,  Herbert.  (1964),  The  empirical  Bayes  approach  to 

statistical  decision  problems.  Ann.  Math .  Statist.  35, 

1-20. 


[19]  Scott,  A.  J.,  and  Symons,  M.  J,  (l97l)°  Clustering  methods  based 
on  likelihood  ratio  criteria.  Biometrics  27,  387-397. 


[20]  Southwell,  R»  (1.94o)„  Relaxation  Methods  in  Engineering  Science: 

A  Treatise  on  Approximate  Computation,  Oxford  University 
Press,  London. 


[21]  Southwell,  R.  (1946).  Relaxation  Methods  in  Theoretical  Physics. 

Oxford  University  Press  (Clarendon),  London  and  New  York. 


[22]  Wolfe,  John  H.  (1970).  Pattern  clustering  by  mult ivari ate  mixture 
analysis.  Multivariate  Behavioral  Research  5;  329-350. 


-28- 


UNCLASSIFIED 


Security  Classification 


4  DESCRIPTIVE  MOTES  (Type  cl  report  *mf  inctuelve  dstse) 

TECHNICAL  REPORT 


I  S  AUTHOR^  «««:•.  tint  nama,  SnSgist } 

|  SCLOVE,  Stanley  L. 

f* 


16  REPORT  DATE 

S  February  1.  1973 


6*  CONTRACT  OR  GRANT  NO. 

i  N00014-67-A-01 12-0030 

,  a.  PROJECT  NO. 

* 

e  NR-042-034) 


■  0  AVAIL  ABILITY/LIMITATION  NOTICES 


7a.  TOTAL  MO.  OP  PAGE!  76.  MO.  OP  REPS 

28  22 


9  a.  ORIGINATOR'S  REPORT  NUMBERS'S) 

#  11 


9 1.  OTHER  REPORT  NO (S)  (Any  othae  nvtmbata  that  nicy  bm  mmal&nad 
thin  report) 


thin  report) 

#71  NSF  GP-32326X 


<  Reproduction  in  whole  or  in  part  is  permitted  for  any  purpose  of  the 
United  states  Government 


12-  SPONSORING  MILITARY  ACTIVITY 

Office  of  Naval  Researcl 
Arlington,  Va. 


|  3  ABSTP.ACT 


The  problem  'f  clustering  individuals  is  considered  within  the  context 
of  a  mixture  of  distributions.  A  modification  of  the  usual  approach  to 
population  mixtures  is  employed.  As  usual,  a  parametric  family  of  dis¬ 
tributions  is  considered,  a  set  of  parameter  values  being  associated  with 
each  population.  In  addition,  with  each  observation  is  associated  an 
identification  parameter,  indicating  from  which  population  the  observation 
arose.  The  resulting  likelihood  function  is  interpreted  in  terms  of  the 
conditional  probability  density  of  a  sample  from  a  mixture  of  populations, 
given  the  identification  parameter  of  each  observation.  Clustering  algorithms 
are  obtained  by  applying  a  metnod  of  iterated  maximum  likelihood  to  this 


Ukel.iliQoa.^f  unct  ion . 

»  1473 


UNCLASSIFIED _ 

Security  Classification 


UNCLASSIFIED 


Security  Classification 


mixture  of  distributions 


cluster  analysis 


isodata  procedure 
k-means  procedure 


Mahalanobis  distance 


INSTRUCTIONS 


1.  ORIGINATING  ACTIVITY:  Enter  the  name  and  address 
of  tho  contractor,  subcontractor,  grantee.  Department  of  Do* 
fenac  activity  or  other  organization  (corporate  author)  Issuing 
tho  report. 

2a.  REPORT  SECURITY  CLASSIFICATION:  Enter  the  over* 
all  security  classification  of  the  report.  Indicate  whether 
"Restricted  Data*’  is  included.  Marking  ia  to  be  In  accord* 
once  with  spproprlate  security  regulations. 

2b,  GROUP;  Autom&ttc  downgrading  ia  specified  in  DoD  Di¬ 
rective  5200.10  and  Aimed  Forces  Industrial  Manuel.  Enter 
the  group  number.  Also,  when  applicable,  show  that  optional 
markings  have  been  used  for  Group  3  and  Group  4  as  author¬ 
ized, 

3.  REPORT  TITLE:  Enter  the  complete  report  title  in  all 
capital  letters.  Titles  in  uil  cases  should  be  unclassified. 

If  a  meaningful  title  cannot  be  selected  without  classifica¬ 
tion,  ahow  title  classification  in  all  capitals  in  parenthesis 
immediately  following  the  title. 

4.  DESCRIPTIVE  NOTES;  If  appropriate,  enter  the  type  of 
report,  e.g.,  interim,  progrcsa,  summary,  annual,  or  final. 

Givo  the  inclusive  dates  when  a  specific  reporting  period  la 
covered, 

5.  AUTHOR(S):  Enter  the  name(s)  of  euthor<B)  *b  ahown  on 
or  in  the  report.  Enter  laet  name,  first  name,  middle  initial. 

If  military,  show  rank  and  branch  of  service.  The  name  of 
the  principal  author  is  an  absolute  minimum  requirement. 

6.  REPORT  DATE:  Enter  tire  dute  of  tho  report  os  day, 
month,  year,  or  month,  year.  If  more  than  one  date  appears 
on  the  report,  ubc  date  of  publication, 

7a.  TOTAL  NUMBER  OK  PAGES;  The  total  page  count 
should  follow  norma!  pagination  procedures,  he.,  enter  the 
number  of  pages  containing  Information. 

7f>.  NUMBER  OF  REFERENCES:  Enter  the  total  number  of 
references  cited  in  the  report. 

8a.  CONTRACT  OR  GRANT  NUMBER:  If  appropriate,  enter 
the  applicable  number  of  the  contract  or  grant  under  which 
the  report  waa  written. 

8b,  8c,  to  8d.  PROJECT  NUMBER:  Enter  the  appropriate 
military  department  Identification,  auch  aa  pioject  numher, 
subproject  number,  system  numbers,  tusk  number,  etc. 

9«,  ORIGINATOR'S  REPORT  NUMBER(S):  Enter  the  offi¬ 
cial  report  number  by  which  the  document  will  be  identified 
and  controlled  by  the  originating  activity.  This  number  muat 
be  unique  to  this  report. 

9b.  OTHER  REPORT  NUMBER(S):  If  the  report  has  been 
assigned  any  other  report  numbers  (cither  by  the  originator 
or  by  the  eponaor),  clao  enter  this  nurober(s). 

10.  AVAIL  ABILITY/LIMITATION  NOTICES;  Enter  any  lim¬ 
itations  on  further  dissemination  of  the  report,  other  than  those 


imposed  by  security  classification,  using  standard  statements 
such  as: 

(1)  "Qualified  requesters  may  obtain  copies  of  this 
report  from  DDC." 

(2)  "Foreign  announcement  and  dissemination  of  this 
report  by  DDC  Is  not  authorized." 

(3)  "U.  S.  Government  agencies  may  obtain  copies  of 
this  report  directly  from  DDC.  Other  qualified  DDC 
users  shall  request  through 


(4)  "U.  S.  military  agencies  may  obtain  copies  of  this 
report  directly  from  DDC  Other  qualified  users 
shall  request  through 

ii 

(5)  "Ail  distribution  of  this  report  Is  controlled.  Qual¬ 
ified  DDC  us  era  shall  request  through 

#  ii 

If  the  report  hsB  been  furnished  to  the  Office  of  Technical 
Services,  Department  of  Commerce,  for  sale  to  tho  public,  Indi¬ 
cate  this  fact  and  enter  the  price,  if  known. 

1L  SUPPLEMENTARY  NOTES:  Ua©  for  additional  explana¬ 
tory  notes. 

12.  SPONSORING  MILITARY  ACTIVITY:  Enter  the  name  of 
the  departmental  project  office  or  laboratory  aponBorlng  (pey~ 
ir\g  (or)  the  research  and  development.  Include  address. 

13  ABSTRACT:  Enter  on  abstract  giving  a  brief  and  factual 
summary  of  the  document  indicative  of  the  report,  even  though 
it  may  also  appoar  elsewhere  in  the  body  of  tha  technical  re¬ 
port.  If  additional  space  is  required,  a  continuation  sheet  ahall 
be  attached. 

It  is  highly  desirable  that  the  abstract  of  classified  reports 
be  unclassified.  Each  paragraph  of  the  abstract  ahaii  end  with 
an  indication  of  the  military  security  classification  of  the  in¬ 
formation  in  the  paragraph,  represented  sa  (TS),  ( S ),  ( C ),  or  (V). 

There  ia  no  limitation  on  the  length  of  the  ubstrset.  How¬ 
ever,  the  suggested  length  is  from  150  to  225  words. 

14.  KEY  WORDS.*  Key  words  are  technically  meaningful  terms 
or  short  phrosea  that  characterize  a  report  and  may  be  uaed  as 
Index  entries  for  cataloging  the  report.  Key  words  must  be 
selected  so  that  no  security  classification  1b  required.  Identi¬ 
fiers,  such  as  equipment  model  designation,  trade  name,  military 
project  code  name,  geographic  location,  may  bo  uaed  as  key 
words  but  will  be  followed  by  an  indication  of  technical  con¬ 
text.  The  assignment  of  iinka,  ralea,  and  weights  is  optional. 


D  1473  (BACK) 


Unclassified 
Security  Classification 


